The purpose of this tutorial is to provide a basic understanding of Probit Regression and its implementation in R, Python, Stata, and SAS, using the “Female Labor Force Participation” data set.

##Model Introduction

A probit regression is a version of the generalized linear model used to model dichotomous outcome variables. It uses the inverse standard normal distribution as a linear combination of the predictors. The binary outcome variable Y is assumed to have a Bernoulli distribution with parameter p (where the success probability is \(p \in (0,1)\)). Hence, the probit link function is \[probit(EY) = \Phi^{-1}(p_i) = \Phi^{-1}(P_i[Y=1]) = \sum_{k=0}^{k=n}\beta_kx_{ik}\] where \[\Phi=\frac{1}{\sqrt{2\pi}} \int_{-\infty}^{\alpha+\beta X} exp(-\frac{1}{2}Z^2)dZ\].

##Dataset: Female Labor Participation

In this tutorial, we work on Mroz data set on female labor participation with 8 variables. The data covers a sample of 753 married white women aged between 30 and 60 collected in 1975. The original source for this data is Mroz, T.A. (1987). “The sensitivity of an empirical model of married women’s hours of work to economic and statistical assumptions.” Econometrica 55, 765-799. The subset of the original data set used in this study can be found here. The description of the variables can be found below.

Variable(s) Description Type
lfp Labor-force participation of the married white woman Categorical: 0/1
k5 Number of children younger than 6 years old Positive integer
k618 Number of children aged 6-18 Positive integer
age Age in years Positive integer
wc Wife`s college attendance Categorical: 0/1
hc Husband`s college attendance Categorical: 0/1
lwg Log expected wage rate for women in the labor force Numerical
inc Family income without the wife`s income Numerical

The goal of the tutorial is to identify whether certain characteristics of a woman’s household and personal life can predict her labor-force participation.

##Languages {.tabset}

###R

1. Data Summary

First, we load data into R by using read.csv given that the data is in a comma-separated format.

mroz = read.csv('./mroz.csv')

This is what the first six rows look like in R:

head(mroz)
##   X lfp k5 k618 age  wc hc       lwg    inc
## 1 1 yes  1    0  32  no no 1.2101647 10.910
## 2 2 yes  0    2  30  no no 0.3285041 19.500
## 3 3 yes  1    3  35  no no 1.5141279 12.040
## 4 4 yes  0    3  34  no no 0.0921151  6.800
## 5 5 yes  1    2  31 yes no 1.5242802 20.100
## 6 6 yes  0    0  54  no no 1.5564855  9.859

Then, we change all binary variables to be numeric.

mroz$lfp = ifelse(mroz$lfp=="yes", 1, 0)
mroz$wc = ifelse(mroz$wc=="yes", 1, 0)
mroz$hc = ifelse(mroz$hc=="yes", 1, 0)

Here is the quick summary of the data.

summary(mroz)
##        X            lfp               k5              k618      
##  Min.   :  1   Min.   :0.0000   Min.   :0.0000   Min.   :0.000  
##  1st Qu.:189   1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.000  
##  Median :377   Median :1.0000   Median :0.0000   Median :1.000  
##  Mean   :377   Mean   :0.5684   Mean   :0.2377   Mean   :1.353  
##  3rd Qu.:565   3rd Qu.:1.0000   3rd Qu.:0.0000   3rd Qu.:2.000  
##  Max.   :753   Max.   :1.0000   Max.   :3.0000   Max.   :8.000  
##       age              wc               hc              lwg         
##  Min.   :30.00   Min.   :0.0000   Min.   :0.0000   Min.   :-2.0541  
##  1st Qu.:36.00   1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.: 0.8181  
##  Median :43.00   Median :0.0000   Median :0.0000   Median : 1.0684  
##  Mean   :42.54   Mean   :0.2815   Mean   :0.3918   Mean   : 1.0971  
##  3rd Qu.:49.00   3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.: 1.3997  
##  Max.   :60.00   Max.   :1.0000   Max.   :1.0000   Max.   : 3.2189  
##       inc        
##  Min.   :-0.029  
##  1st Qu.:13.025  
##  Median :17.700  
##  Mean   :20.129  
##  3rd Qu.:24.466  
##  Max.   :96.000

2. Fitting model by Probit Regression

Next, we run a probit regression using lfp as a response variable and all the remaining variables as predictors.

mroz.probit <- glm(lfp ~ k5 + k618 + age + wc + hc + lwg + inc, 
                  data = mroz,
                  family = binomial(link = "probit"))
summary(mroz.probit)
## 
## Call:
## glm(formula = lfp ~ k5 + k618 + age + wc + hc + lwg + inc, family = binomial(link = "probit"), 
##     data = mroz)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.1359  -1.1024   0.5967   0.9746   2.2236  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  1.918418   0.382356   5.017 5.24e-07 ***
## k5          -0.874712   0.114423  -7.645 2.10e-14 ***
## k618        -0.038595   0.040950  -0.942 0.345942    
## age         -0.037824   0.007605  -4.973 6.58e-07 ***
## wc           0.488310   0.136731   3.571 0.000355 ***
## hc           0.057172   0.124207   0.460 0.645306    
## lwg          0.365635   0.089992   4.063 4.85e-05 ***
## inc         -0.020525   0.004852  -4.230 2.34e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1029.75  on 752  degrees of freedom
## Residual deviance:  905.39  on 745  degrees of freedom
## AIC: 921.39
## 
## Number of Fisher Scoring iterations: 4

We can also obtain confidence intervals for the coefficient estimates.

confint(mroz.probit)
##                   2.5 %      97.5 %
## (Intercept)  1.17740788  2.66999634
## k5          -1.10050075 -0.65525817
## k618        -0.11806899  0.04069610
## age         -0.05283742 -0.02300114
## wc           0.22421495  0.75547759
## hc          -0.18586158  0.30035566
## lwg          0.19375227  0.53805204
## inc         -0.02999733 -0.01126605

3. Marginal effect

We first make a cross-tab of categorical predictors with our binary response variable and calculate adjusted predictions of lfp for two levels of hc and wc variables.

addmargins(table(mroz$lfp, mroz$hc, deparse.level=2))
##         mroz$hc
## mroz$lfp   0   1 Sum
##      0   207 118 325
##      1   251 177 428
##      Sum 458 295 753
hc_data0 = data.frame(k5 = mean(mroz$k5), k618 = mean(mroz$k618), age=mean(mroz$age), 
                      lwg= mean(mroz$lwg), inc=mean(mroz$inc), hc=0, wc=mean(mroz$wc))
hc_data1 = data.frame(k5 = mean(mroz$k5), k618 = mean(mroz$k618), age=mean(mroz$age), 
                      lwg= mean(mroz$lwg), inc=mean(mroz$inc), hc=1, wc=mean(mroz$wc))

h0 = predict(mroz.probit, hc_data0, type="response", se=TRUE)
h1 = predict(mroz.probit, hc_data1, type="response", se=TRUE)

hc_fit = data.frame(Margin = c(h0$fit[1], h1$fit[1]), se=c(h0$se.fit[1], h1$se.fit[1]))
hc_fit
##      Margin         se
## 1 0.5693807 0.02726260
## 2 0.5917191 0.03463592
addmargins(table(mroz$lfp, mroz$wc, deparse.level=2))
##         mroz$wc
## mroz$lfp   0   1 Sum
##      0   257  68 325
##      1   284 144 428
##      Sum 541 212 753
wc_data0 = data.frame(k5 = mean(mroz$k5), k618 = mean(mroz$k618), age=mean(mroz$age), 
                      lwg= mean(mroz$lwg), inc=mean(mroz$inc), hc = mean(mroz$hc), wc=0)
wc_data1 = data.frame(k5 = mean(mroz$k5), k618 = mean(mroz$k618), age=mean(mroz$age), 
                      lwg= mean(mroz$lwg), inc=mean(mroz$inc), hc = mean(mroz$hc), wc=1)

w0 = predict(mroz.probit, wc_data0, type="response", se=TRUE)
w1 = predict(mroz.probit, wc_data1, type="response", se=TRUE)

wc_fit = data.frame(Margin = c(w0$fit[1], w1$fit[1]), se=c(w0$se.fit[1], w1$se.fit[1]))
wc_fit
##      Margin         se
## 1 0.5238094 0.02402095
## 2 0.7081631 0.03847631
wc_data0_age=data.frame(k5=rep(mean(mroz$k5), 4), k618=rep(mean(mroz$k618), 4), 
                        age=c(30, 40, 50, 60), lwg=rep(mean(mroz$lwg), 4), 
                        inc=rep(mean(mroz$inc), 4), wc=rep(0, 4), 
                        hc=rep(mean(mroz$hc), 4))

wc_data1_age=data.frame(k5=rep(mean(mroz$k5), 4), k618=rep(mean(mroz$k618), 4), 
                        age=c(30, 40, 50, 60), lwg=rep(mean(mroz$lwg), 4), 
                        inc=rep(mean(mroz$inc), 4), wc=rep(1, 4), 
                        hc=rep(mean(mroz$hc), 4))

m0=predict(mroz.probit, wc_data0_age, type="response", se=TRUE)
m1=predict(mroz.probit, wc_data1_age, type="response", se=TRUE)

wc_fit_age = data.frame(Margin_wc0=m0$fit, Margin_wc1=m1$fit, se_wc0=m0$se.fit, se_wc1=m1$se.fit)
wc_fit_age
##   Margin_wc0 Margin_wc1     se_wc0     se_wc1
## 1  0.7033092  0.8466691 0.03936089 0.03561642
## 2  0.5618680  0.7402177 0.02509639 0.03716213
## 3  0.4119514  0.6047963 0.03193009 0.04744612
## 4  0.2739989  0.4552319 0.04823869 0.06726396
k_data=data.frame(k5=c(0,1,2,3), k618=rep(mean(mroz$k618), 4), age=rep(mean(mroz$age), 4), 
                  lwg=rep(mean(mroz$lwg), 4), inc=rep(mean(mroz$inc), 4), 
                  wc = rep(mean(mroz$wc), 4), hc=rep(mean(mroz$hc), 4))

k0=predict(mroz.probit, k_data, type="response", se=TRUE)

k_fit = data.frame(Margin_k=k0$fit, se_k=k0$se.fit)
k_fit
##     Margin_k       se_k
## 1 0.65730850 0.02052209
## 2 0.31932620 0.03563207
## 3 0.08942632 0.03349808
## 4 0.01324307 0.01087202
library(effects)
all.effects <- allEffects(mod = mroz.probit)
plot(all.effects, type="response", ylim=c(0,1))

###Python

####Loading Data

The data set has binary response variable lfp which stands for labor force participation

First we start by loading the data into memory using pandas package. We also load some relevant packages used in this analysis.

import pandas as pd
import numpy as np
from statsmodels.discrete.discrete_model import Probit
data = pd.read_csv('https://vincentarelbundock.github.io/Rdatasets/csv/carData/Mroz.csv')
print(data.head())

This looks like

   Unnamed: 0  lfp  k5  k618  age   wc  hc       lwg        inc
0           1  yes   1     0   32   no  no  1.210165  10.910001
1           2  yes   0     2   30   no  no  0.328504  19.500000
2           3  yes   1     3   35   no  no  1.514128  12.039999
3           4  yes   0     3   34   no  no  0.092115   6.800000
4           5  yes   1     2   31  yes  no  1.524280  20.100000

####Data Cleaning

Since some data is read in as strings we can transform them into binary categorical data using the following command. We also drop the first column as it is read in with row numbers, which we do not need.

data = data.drop(data.columns[0], axis = 1)
data["lfp"] = data["lfp"] == "yes"
data["wc"] = data["wc"] == "yes"
data["hc"] = data["hc"] == "yes"

Looking at the data again we see:

    lfp  k5  k618  age     wc     hc       lwg        inc
0  True   1     0   32  False  False  1.210165  10.910001
1  True   0     2   30  False  False  0.328504  19.500000
2  True   1     3   35  False  False  1.514128  12.039999
3  True   0     3   34  False  False  0.092115   6.800000
4  True   1     2   31   True  False  1.524280  20.100000

####Summary Statistics

To generate some summary statistics we can use the functions describe on a data frame.

print(data.describe())

This generates summary statistics for the continuous variables in our dataset.

               k5        k618         age         lwg         inc
count  753.000000  753.000000  753.000000  753.000000  753.000000
mean     0.237716    1.353254   42.537849    1.097115   20.128965
std      0.523959    1.319874    8.072574    0.587556   11.634799
min      0.000000    0.000000   30.000000   -2.054124   -0.029000
25%      0.000000    0.000000   36.000000    0.818086   13.025000
50%      0.000000    1.000000   43.000000    1.068403   17.700001
75%      0.000000    2.000000   49.000000    1.399717   24.466000
max      3.000000    8.000000   60.000000    3.218876   96.000000

####Fitting Regression

First we break our dataset into response variable and predictor variables. Then we use the statsmodels function to fit our Probit regression.

Y = data["lfp"]
X = data.drop(["lfp"], 1)
model = Probit(Y, X.astype(float))
result = model.fit()
print(result.summary())

The following is the results of our regression

Optimization terminated successfully.
         Current function value: 0.618620
         Iterations 5
                          Probit Regression Results                           
==============================================================================
Dep. Variable:                    lfp   No. Observations:                  753
Model:                         Probit   Df Residuals:                      746
Method:                           MLE   Df Model:                            6
Date:                Mon, 26 Nov 2018   Pseudo R-squ.:                 0.09527
Time:                        23:16:26   Log-Likelihood:                -465.82
converged:                       True   LL-Null:                       -514.87
                                        LLR p-value:                 6.234e-19
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
k5            -0.6136      0.098     -6.249      0.000      -0.806      -0.421
k618           0.0674      0.034      1.965      0.049       0.000       0.135
age           -0.0021      0.003     -0.775      0.439      -0.007       0.003
wc             0.4497      0.133      3.372      0.001       0.188       0.711
hc             0.1267      0.122      1.040      0.298      -0.112       0.365
lwg            0.4632      0.084      5.486      0.000       0.298       0.629
inc           -0.0187      0.005     -3.983      0.000      -0.028      -0.010
==============================================================================

###Stata

1.Data Summary

Firstly, We import the Mroz data from website and show the first six rows of the dataset.

*Importing data 
import delimited https://vincentarelbundock.github.io/Rdatasets/csv/carData/Mroz.csv, clear
save mroz,replace
use mroz,clear
*List the first six rows 
list if v1<=6

Then, We change all binary variables to be numeric, and we get a summary of the data. Our response is lfp and its mean is 0.57. The range of age is from 30 to 60.

*Change variables with values yes/no to 1/0
gen lfpart =1 if lfp == "yes"
replace lfpart =0 if lfp == "no"
gen wifec =1 if wc == "yes"
replace wifec =0 if wc == "no"
gen husbc =1 if hc == "yes"
replace husbc =0 if hc == "no"
drop lfp wc hc
rename lfpart lfp
rename wifec wc
rename husbc hc
*Get the summary of the data
summ

2.Fitting model by Probit Regression

Now, we fit our data by probit regression. lfp is the response and the remaining variables are predictors. Looking at the p-values, all variables have highly significant, except k618 and hc.

*Fitting the data by probit regression
probit lfp k5 k618 age lwg inc i.wc i.hc

We get a summary of the probit prediction from the fitted model, we get the smallest probability is
0.005691 and the largest probability is 0.9745. The 50% percentile is 0.5782336, which is close to its mean we showed above.

*Predicting the probability of labor-force  participation
predict prob_lfp
summ prob_lfp, detail

3.Marginal effect

Now, we predict the data for groups defined by levels of categorical variables.

Group by hc

First, we make a table of frequently count of hc and lfp we predict the lfp for two groups: hc=0 and hc=1, and we keep other variables at mean.

tab lfp hc

*use margins for each level of hc
margins hc, atmeans

The marginal probability of hc=1 (husband has attained college) is 0.59 and it slightly higher than the marginal probability of hc=0 (husband has not attained college), which is 0.57. There is not obivious differnce. It is reasonable because the p-value of hc is very high.

Group by wc

The table of frequently shows that when wc=0, the proportion of lfp is average, which is closed to 0.5. However, when wc=1, the proportion of lfp=1 is much higher.

tab lfp wc

we predict the lfp for two groups: wc=0 and wc=1, and we keep other variables at mean.

*use margins for each level of wc
margins wc, atmeans

The result shows that the marginal probability is 0.71 when wc=1 and the marginal probability is 0.52 when wc=0. The probability of participating labor-force is higher when wife has attended college. We can say that wife’s college attendance is an important predictor.

We can go deeper on the predictor wc. We predict lfp for group by age and wc. Age is at every 10 years of age from 30 to 60. Since the output of marginal function is long, we make a plot to visualize the output and it is easier to interpert.

*use margins for each level of wc and age
margins, at(age=(30(10)60) wc=(0 1)) atmeans vsquish

marginsplot

From the marginal plot, we can conclude that when age is increasing, the probability is decreasing. Also, The probability of wc=1 is always higher than wx=0. At age 60, the variablity is the highest because the 95% confidence interval is the widest.

Group by k5

The table of frequently shows that the proportion of lfp is decreasing when k5 is increasing.

tab lfp k5

we predict the lfp by k5= 0 1 2 3, and we keep other variables at mean. Also, we make a plot to visualize the data.

*use margins for each level of k5
margins, at(k5=(0 1 2 3)) atmeans

marginsplot

The output shows that when women do not have any children 5 years old or younger, the probability of participating labor-force is 0.66 which is higher than the average. However, after they had childrens, the probability of participating labor-force is decreasing. Therefore, we can conclude that k5 is a significant predictor.

###SAS

1.Data Summary

Since for the original mroz.csv dataset, there are three covariates whose variable type are binary charater (“yes/no”): lfp, wc, hc. In this part, a processed data file would be used, where the values of the three character type variables are transfered from “yes/no” to numeric 1/0.

/*Importing data*/
proc import datafile = 'C:\Users\lanshi\Desktop\Mroz.csv'
  out = work.mroz
  dbms = CSV
  ;
run;

Display the summary table of the dataset (Seperate the 3 binary variables and the others).

/*For the other non-binary variables*/
proc means data=mroz;
 var k5 k618 age lwg inc;
run;

/*For those 3 binary variables, consider individuals and interactions between reponse and wc/hc.*/
proc freq data=mroz;
 tables lfp wc hc lfp*wc lfp*hc;
run;

2.Fitting model by Probit Regression

16bbf44cc96e1399eac40d66dc663801bc230857 Now, we fit our data by probit regression. lfp is the response and the remaining variables are predictors. By adding argument “descending”, we would be able to model 1s rather than 0s, which means predicting the probability of woman getting into label force (lfp=1) versus not getting in (lfp=0).

/*Fitting the data by probit regression*/
proc logistic data=mroz descending;
  class lfp wc hc / param=ref ;
  model lfp = k5 k618 age lwg inc wc hc /link=probit;
run;

Here are the results we get from the regression: The result of the Global Null Hypothesis (by Likelihood Ratio, Score and Wald test) indicates that the model is statistically significant.

As for each variable in the model, it is shown by the above two tables that: variable k618 and hc both have p-value greater than 0.05, thus under significant level alpha being 0.05, they are not statistically significant. The reason for k618 might be that kids of 6-18 are much more independent than kids less than 6, and their mom could have time to work. Regarding hc, whether husband has attended college won’t really affect their wife’s decision on whether being in the labor force or not.

Detailed Interpretations: (Definition of z-score: the probit regression coefficients give the change in the probit index, also called a z-score, for a one unit increase in the predictor variable.) • k5: For each one more kid less than 6 years old, the z-score decreases by 0.8747. • k618: For each one more kid of 6-18 years old, the z-score decreases by 0.0386. • age: For one year increase in a woman’s age, the z-score decreases by 0.0378. • lwg: For a one unit increase in log(wage), the z-score increases by 0.3656. • inc: For a one unit increase in (family income – wage*hours)/1000, the z-score decreases by 0.0205. • wc: Wife having attended college would increases the z-score by 0.4883. • hc: Husband having attended college would increases the z-score by 0.0572.

3.Marginal effect

Now, we predict the data for groups defined by levels of categorical variables. Keeping other variables at mean.

/*Fitting the data by probit regression*/
proc logistic data=mroz descending plots=EFFECT;
  class lfp wc hc / param=ref ;
  model lfp = k5 k618 age lwg inc wc hc /link=probit;
  output out=estimated predicted=estprob l=lower95 u=upper95;
run;

The code above also creates several plots for model diagnosis:

We can see from these diagonosis plots, the regression model’s performance is not bad.

##Conclusion

##References